Avoiding Plagiarism in Markov Sequence Generation
نویسندگان
چکیده
Markov processes are widely used to generate sequences that imitate a given style, using random walk. Random walk generates sequences by iteratively concatenating states to prefixes of length equal or less than the given Markov order. However, at higher orders, Markov chains tend to replicate chunks of the corpus with a size possibly higher than the order, a primary form of plagiarism. The Markov order defines a maximum length for training but not for generation. In the framework of constraint satisfaction (CSP), we introduce MAXORDER. This global constraint ensures that generated sequences do not include chunks larger than a given maximum order. We exhibit an automaton that recognises the solution set, with a size linear in the size of the corpus. We propose a linear-time procedure to generate this automaton from a corpus and a given max order. We then use this automaton to achieve generalised arc consistency for the MAXORDER constraint, holding on a sequence of size n, in O(n.T ) time, where T is the size of the automaton. We illustrate our approach by generating text sequences from text corpora with a maximum order guarantee, effectively controlling plagiarism. Introduction Markov chains are a powerful, widely-used technique to analyse and generate sequences that imitate a given style (Brooks et al. 1957; Pinkerton 1956), with applications to many areas of automatic content generation such as music, text, line drawing and more generally any kind of sequential data. A typical use of such models is to generate novel sequences that “look” like or “sound” like the original. From a corpus of finite-length sequences considered as representative of the style of an author, a Markov model of the style is estimated based on the Markov hypothesis which states that the future state of a sequence depends only on the last state, i.e.: p(si+1|s1, . . . , si) = p(si+1|si). The equation above describes a Markov model of order 1. The definition can be extended to higher orders by considering prefixes of length k greater than 1. Copyright c © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. p(si+1|s1, . . . , si) = p(si+1|si−k+1, . . . , si). In theory, higher order Markov models are equivalent to order 1 models. However, in practice, higher order models offer a better compromise between expressivity and representation cost. Variable order Markov models are often used to produce sequences with varying degrees of similarity with the corpus (Begleiter, El-Yaniv, and Yona 2004). Indeed, increasing the Markov order produces sequences that replicate larger chunks of the original corpus, thereby improving the impression of style imitation. However, it has also been long observed (Brooks et al. 1957) that increasing the order tends to produce sequences that contain chunks of the corpus of size much larger than the Markov order. We illustrate this phenomenon on a text corpus: Johnston’s English translation of Pushkins Eugene Onegin – a reference to Markov, as he used the same corpus (in Russian) for his pioneering studies. Here, an element of the Markov chain is a word of the text or a sentence separator, and a sequence is a succession of such elements. With a Markov order of 1, we obtain the following sequence: Praskovya re-baptized “Polina”. Walking her secret tome that rogue, backbiter, pantaloon, bribe-taker, glutton and still eats, and featherbeds, and enjoyment locked him all went inside a day wood below the flower was passion and theirs was one who taught her handkerchief has measured off in caravan the finest printer with pulses racing down, he’ll be nothing could draw it abounded. On top of the text, we draw the longest subsequences that appear verbatim the corpus, or chunks, assigning different colours to different lengths. For example, this generated sequence contains the chunk “[...] that rogue, backbiter, pantaloon, bribe-taker, glutton and [...]”, which is a subsequence of length 7 from the corpus. The maximum order of a seProceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence
منابع مشابه
Cryptomnesia: Delineating Inadvertent Plagiarism
Cryptomnesia, or inadvertent plagiarism, was experimentally examined in three investigations. Subjects were required to generate category exemplars, alternating with 3 other subjects in Experiments 1 and 2 or with a standardized, written list in Experiment 3. After this generation stage, subjects attempted to recall those items which they had just generated and an equal number of completely new...
متن کاملNo evidence of age-related increases in unconscious plagiarism during free recall.
In three experiments younger and older participants took part in a group generation task prior to a delayed recall task. In each, participants were required to recall the items that they had generated, avoiding plagiarism errors. All studies showed the same pattern: older adults did not plagiarise their partners any more than younger adults did. However, older adults were more likely than young...
متن کاملAvoiding plagiarism, self-plagiarism, and other questionable writing practices: A guide to ethical writing
In recognizing the importance of educating aspiring scientists in the responsible conduct of research (RCR), the Office of Research Integrity (ORI) began sponsoring the creation of instructional resources to address this pressing need in 2002. The present guide on avoiding plagiarism and other inappropriate writing practices was created to help students, as well as professionals, identify and p...
متن کاملWriting an Abstract
Stevenson, 2004) Key Parts The opportunity to design and deliver short programs on referencing and avoiding plagiarism for transnational UniSA students has confirmed the necessity of combating both the ‘all-plagiarism-is-cheating’ reaction and the ‘just-give-them-a-referencing-guide’ response. The notion of referencing is but the tip of a particularly large and intricate iceberg. Consequently, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014